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DETAILED ACTION 
Response to Arguments 

Applicant's arguments, see Arguments/Remarks, filed December 5, 2007, with 
respect to the rejection(s) of claim(s) 1-31 under 35 U.S.C. §103 have been fully 
considered and are persuasive. Therefore, the rejection has been withdrawn. 
However, upon further consideration, a new ground(s) of rejection is made in view of 
Colbath "Spoken Document: Creating Searchable Archives from Continuous Audio" 
IEEE 2000). 

Claim Rejections - 35 USC § 103 

The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 102 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by ttie manner in which the invention was made. 

Claims 1-12 and 14-31 are rejected under 35 U.S.C. 103(a) as being 

unpatentable over Colbath ("Spoken Document: Creating Searchable Archives from 

Continuous Audio" lEE 2000) in view of Liu ("Fast Speaker Change Detection for 

Broadcast News Transcription and Indexing", and further in view of Stanford 

(5,475,792). 
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As per claim 1 , Colbath discloses a method for classifying an audio signal containing 
speech information, the method comprising: 

Receiving the audio signal (page 2, Component Technologies, audio wave file); 

Classifying the sound in the audio signal based on at least one non-phoneme 
based model (page 3, Speaker Segmentation, Clustering, and Identification, the 
speakers are classified by gender for segmentation and identification). 

Colbath does not disclose classifying a sound in the audio signal as a vowel class 
when a first phoneme-based model determines that the sound corresponds to a sound 
represented by a set of phonemes that define vowels, classifying the sound in the audio 
signal as a fricative class when a second phoneme-based model determines that the 
sound corresponds to a sound represented by a set of phonemes that define 
consonants, and classifying the sound in the audio signal based on at least one non- 
phoneme based model, the at least one non-phoneme model including at least one 
model for classifying the sound in the audio signal based on bandwidth. However, 
Colbath does disclose the use of a speech recognition component and a speaker 
segmentation component as part of a system designed to transform an audio wave file 
into an indexed database (page 2, Component Technologies), but does not provide 
further details on either component. Liu discloses classifying a sound in the audio signal 
as a vowel class when a first phoneme-based model determines that the sound 
corresponds to a sound represented by a set of phonemes that define vowels (section 
3, Phone-Class Decode and Figure 1), classifying the sound in the audio signal as a 
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fricative class when a second phoneme-based model determines that the sound 
corresponds to a sound represented by a set of phonemes that define consonants 
(section 3, Phone-class decode and Figure 1 ). Liu discloses a fast speaker change 
detection algorithm for fast transcription and audio indexing of spoken broadcast news 
(Abstract). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to use the phone-based models of Liu in Colbatii, since it would result 
in improverhents in speaker change detection accuracy, speech recognition accuracy 
and speed, as disclosed in Liu (Abstract). 

Additionally, Stanford discloses a speech recognition system that enables 
recognition of high bandwidth or telephony (low bandwidth) speech signals (column 2 
lines 30-32 and column 8 lines 36-44). Sfaitford states that low bandwidth speech 
reduces the accuracy of speech recognizers (column 3 lines 37-39), and discloses a 
system that trains and uses two separate codebook and phoneme models, one for low 
bandwidth speech and one for high bandwidth speech (column 8 lines 36-44). The 
addition of the low bandwidth recognition model improves recognition accuracy for low 
bandwidth input, such as telephone speech. 

Therefore it would have been obvious to one or ordinary skill in the art at the time 
of the invention to classifying the sound in the audio signal based on at least one non- 
phoneme based model, the at least one non-phoneme model including at least one 
model for classifying the sound in the audio signal based on bandwidth in Colbatti, 
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since one of ordinary skill in the art has good reason to pursue the options within his or 
her technical grasp in order to achieve the predictable result of accurately recognizing 
and segmenting audio information for indexing, regardless of the input quality. 

As per claim 2, Colbath in view of Liu, and further in view of Sta/iford disclose the 
method of claim 1 , and Colbath further discloses wherein the at least one non- 
phoneme based model includes models for classifying the sound in the audio signal 
based speaker gender (page 3, Speaker Segmentation, Clustering, and Identification, 
the speakers are classified by gender for segmentation and identification). 

As per claim 3, Colbath in view of L/ii, and further in view of Stanford disclose the 
method of claim 1 , however Colbath does not disclose wherein the at least one non- 
phoneme based model includes a model for classifying the sound in the audio signal as 
silence, Liu discloses wherein the at least one non-phoneme based model includes a 
model for classifying the sound in the audio signal as silence (Section 3 Phone-Class 
Decode, paragraph 3, silence model). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to use a silence model of Liu as a non-phoneme based model in 
Colbath, since it is a much more effective model than energy based models for 
detecting non-speech regions, as indicated in Liu (page 3, section 3 Phone-Class 
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Decode, paragraph2), which in turn improves speaker change detection and speaker 
clustering or identification, as indicated in Liu (page 3, section 3, Phone-Class Decode). 



As per claims 4 and 5, Colbath in view of Liu, and further in view of Stanford disclose 
the method of claim 1 , but Colbath does not disclose initially converting the audio signal 
into a frequency domain signal, and generating cepstral features for the audio signal. 
However, during standard speech processing, input speech is converted into a 
frequency domain representation by Fourier Transform, then converted into specific 
feature vectors, such as cepstral vectors. This is confirmed by Liu, which discloses the 
use of cepstral vectors for speaker change detection (section 4 Speaker Change 
Detection). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to convert the audio signal to a frequency domain signal, and generate 
cepstral features in Colbath, since one of ordinary skill in the art has good reason to 
pursue the options within his or her technical grasp in order to achieve the predictable 
result of determining reliable feature vectors for speech processing. 

As per claim 6, Colbath in view of Liu, and further in view of Stanford disclose the 
method of claim 1 , however Colbath does not disclose wherein the fricative class 
includes phonemes that relate to fricatives and obstruents. Liu discloses wherein the 
fricative class includes phonemes that relate to fricatives and obstruents (Section 3 
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Phone-Class Decode, paragraphs 3 and 4, fricatives are a specific type of obstruent, 
ttierefore it is inlierent tliat the class includes phonemes that relate to obstruents). 

Therefore it would be obvious to one of ordinary skill in the art at the time of the 
invention to have a fricative class that includes phonemes that relate to fricatives and 
obstruents in Colbath, since it would result in improvements in speaker change 
detection accuracy, speech recognition accuracy and speed, as disclosed in Liu 
(Abstract). 

As per claim 7, Colbath in view of Liu, and further in view of Sfa/iforcf disclose the 
method of claim 1 , however Colbath does not disclose wherein the first and second 
phoneme-based models are Hidden Markov Models. However, Hidden Markov Models 
are statistical models commonly used as phoneme models in speech and language 
processing. In addition, Liu discloses wherein the first and second phoneme-based 
models are Hidden Markov Models (Section 3 Phone-Class Decode, paragraph 6). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to use HMM for the phone-based models in Colbath, since one of 
ordinary skill in the art has good reason to pursue the options within his or her technical 
grasp in order to achieve the predictable result of determining robust and reliable 
phone-based models. 



Application/Control Number: Page 8 

10/685,585 

Art Unit: 2626 

As per claim 8, Colbath in view of Liu, and further in view of Stanford disclose the 
method of claim 1 , however Colbath does not disclose classifying the sound in the 
audio signal as a coughing class when the sound corresponds to a non-speech sound. 
Liu does not explicitly disclose classifying the sound in the audio signal as a coughing 
class when the sound corresponds to a non-speech sound, however Liu does disclose 
coughing as a common non-speech sound (Section 2 Evaluation Metrics, second 
paragraph). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to classify the sound in the audio signal as a coughing class when the 
sound corresponds to a non-speech sound in Colbatii, since classification of coughing 
as non-speech would enable the exclusion of those frames during speaker clustering for 
identifying speakers, as taught by Liu (Section 3 Phone-Class Decode, first paragraph), 
thus improving speaker segmentation. 

As per claim 9, Colbath in view of Liu, and further in view of Stanford disclose the 
method of claim 1 , however Colbath does not disclose wherein the non-speech sound 
includes at least one of coughing, laughter, breath, and lip-smack. Liu discloses 
wherein the non-speech sound includes at least one of coughing, laughter, breath, and 
lip-smack (Section 2 Evaluation Metrics, second paragraph). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to identify non-speech sounds as coughing, laughter, breath and lip- 
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smack in Colbath, since classification of non-speecli frames enables the exclusion of 
those frames during speaker clustering for identifying speakers, as taught by Liu 
(Section 3 Phone-Class Decode, first paragraph), which improves speaker 
segmentation. 

As per claim 10, Colbath discloses a method of training audio classification models, the 
method comprising: 

Receiving a training audio signal (page 7-8, Annotation, training data)\ 

Receiving phoneme classes corresponding to the training audio signal (page 7-8, 
Annotation, training data)] 

However, Colbath does not disclose training a first Hidden Markov Model (HMM), 
based on the training audio signal and the phoneme classes, to classify speech as 
belonging to a vowel class when the first HMM determines that the speech corresponds 
to a sound represented by a set of phonemes that define vowels, training a second 
HMM, based on the training audio signal and the phoneme classes, to classify speech 
as belonging to a fricative class when the second HMM determines that the speech 
corresponds to a sound represented by a set of phonemes that define consonants, and 
training at least one model to classify the sound based on a bandwidth of the sound. Liu 
discloses training a first Hidden Markov Model (HMM), based on the training audio 
signal and the phoneme classes, to classify speech as belonging to a vowel class when 
the first HMM determines that the speech corresponds to a sound represented by a set 
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of phonemes that define vowels (Section 3 Phone-Class Decode, paragraphs 3 and 4), 
training a second HMM, based on the training audio signal and the phoneme classes, to 
classify speech as belonging to a fricative class when the second HMM determines that 
the speech corresponds to a sound represented by a set of phonemes that define 
consonants (Section 3 Phone-Class Decode, paragraphs 3 and 4). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to use the phone-based models of Liu in Colbath, since it would result 
in improvements in speaker change detection accuracy, speech recognition accuracy 
and speed, as disclosed in Liu (Abstract). 

Additionally, Stanford discloses a speech recognition system that enables 
recognition of high bandwidth or telephony (low bandwidth) speech signals (column 2 
lines 30-32 and column 8 lines 36-44). Sfanford states that low bandwidth speech 
reduces the accuracy of speech recognizers (column 3 lines 37-39), and discloses a 
system that trains and uses two separate codebook and phoneme models, one for low 
bandwidth speech and one for high bandwidth speech (column 8 lines 36-44). The 
addition of the low bandwidth recognition model improves recognition accuracy for low 
bandwidth input, such as telephone speech. 

Therefore it would have been obvious to one or ordinary skill in the art at the time 
of the invention to train at least one model to classify the sound based on a bandwidth 
of the sound in Coibath, since one of ordinary skill in the art has good reason to pursue 
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the options within his or her technical grasp in order to accurately recognize and 
segment audio information for indexing, regardless of the input quality of the audio. 

As per claim 1 1 , Colbath in view of Liu, and further in view of Stanford disclose the 
method of claim 10, however Colbath does not discloses wherein the phoneme classes 
include information that defines word boundaries. Liu further discloses wherein the 
phoneme classes include information that defines word boundaries (Section 5 
Experiments and Results, Word-Error-Rate (WER), the system determines the word 
error rate, or word recognition accuracy. Therefore it is inherent that the phoneme 
classes include information on word boundaries), Liu also discloses that, for speaker 
change detection, it is important that speaker boundaries are not labeled in the middle 
of words (section 3 Phone-Class Decode). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to include information about word boundaries in the phoneme classes in 
Colbath, sine it would increase speaker segmentation accuracy, as indicated in Liu 
(section 3 Phone-Class Decode). 

As per claim Colbath in view of Liu, and further in view of Stanford disclose the 
method of claim 11, however Colbath does not discloses receiving a sequence of 
transcribed words .corresponding to the audio signal, and generating the information that 
defines the word boundaries based on the transcribed words. Liu discloses receiving a 
sequence of transcribed words corresponding to the audio signal (Section 2 Evaluation 
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Metrics, last paragraph, reference transcription), and generating the information that 
defines the word boundaries based on the transcribed words (Section 2 Evaluation- 
Metrics, last paragraph, f/?e reference transcription is aligned with the acoustic data). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to transcribe words corresponding to the audio signal, and generate the 
information that defines the word boundaries based on the transcribed words in 
Colbath, since the alignment creates a more reliable ground truth for the evaluation of 
errors, as indicated in Liu (section 2 Evaluation Metrics). 

As per claim 14, this claim recites limitations similar to those recited in claim 2, and is 
therefore rejected for similar reasons. 

As per claim 15, this claim recites limitations similar to those recited in claim 6, and is 
therefore rejected fro similar reasons. 

As per claim 16, Colbath discloses audio classification device comprising: 

A decoder configured to classify portions of the audio signal as belonging to at 
least one of a plurality of classes (page 3, Speaker Segmentation, Clustering, and 
Identification, the speal<ers are classified by gender for segmentation and identification). 
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However, Colbath does not disclose a signal analysis component configured to receive 
an audio signal and process the audio signal by at least one of converting the audio 
signal to the frequency domain and generating cepstral features for the audio signal, 
and wherein the classes include a first phoneme-based class that applies to the audio 
signal when a portion of the audio signal corresponds to a sound, represented by a set 
of phonemes that define vowels, a second phoneme-based class that applies to the 
audio signal when a portion of the audio signal corresponds to a sound represented by 
a set of phonemes that define consonants, and at least one non-phoneme class, 
wherein the decoder determines the at lest one non-phoneme class models that classify 
the portions of the audio signal based on bandwidth. However, during standard speech 
processing, input speech is converted into a frequency domain representation by 
Fourier Transform, then converted into specific feature vectors, such as cepstral 
vectors. This is confirmed by Liu, which discloses the use of cepstral vectors for 
speaker change detection (section 4 Speaker Change Detection). Liu also discloses 
classes that include a first phoneme-based class that applies to the audio signal when a 
portion of the audio signal corresponds to a sound, represented by a set of phonemes 
that define vowels, and a second phoneme-based class that applies to the audio signal 
when a portion of the audio signal corresponds to a sound represented by a set of 
phonemes that define consonants (Section 3 Phone-Class Decode, paragraphs 3 and 
4). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to convert the audio signal to a frequency domain signal and generate 
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cepstral features in Colbath, since one of ordinary skill in the art has good reason to 
pursue the options within his or her technical grasp in order to achieve the predictable 
result of determining reliable feature vectors for speech processing. In addition, the use 
of a first and second phoneme-based model would result in improvements in speaker 
change detection accuracy, speech recognition accuracy and speed, as disclosed in Liu 
(Abstract). 

Additionally, Stanford discloses a speech recognition system that enables 
recognition of high bandwidth or telephony (low bandwidth) speech signals (column 2 
lines 30-32 and column 8 lines 36-44). Stenford states that low. bandwidth speech 
reduces the accuracy of speech recognizers (column 3 lines 37-39), and discloses a 
system that trains and uses two separate codebook and phoneme models, one for low 
bandwidth speech and one for high bandwidth speech (column 8 lines 36-44). The 
addition of the low bandwidth recognition model improves recognition accuracy for low 
bandwidth input, such as telephone speech. 

Therefore it would have been obvious to one or ordinary skill in the art at the time 
of the invention to have the decoder determine the at least one non-phoneme class 
using models that classify the portions of the audio signal based on bandwidth in 
Colbath, since one of ordinary skill in the art has good reason to pursue the options 
within his or her technical grasp in order to accurately recognize and segment audio 
information for indexing, regardless of the input quality of the audio. 
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As per claim 17, this claim recites limitations similar to those recited in claim 2, and is 
therefore rejected for similar reasons. 

As per claim 18, this claim recites limitations similar to those recited in claim 7, and is 
therefore rejected for similar reasons. 

As per claim 19, this claim recites limitations similar to those recited in claim 2, and is 
therefore rejected for similar reasons. 

As per claim 20, this claim recites limitations similar to those recited in claim 3, and is 
therefore rejected for similar reasons. 

As per claim 21 , this claim recites limitations similar to those recited in claim 8, and is 
therefore rejected for similar reasons. 

As per claim 22, this claim recites limitations similar to those recited in claim 9, and is 
therefore rejected for similar reasons. 



As per claim 23, Colbath discloses a system comprising: 
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An indexer configured to receive input audio data and generate a ricli transcript 
from the audio data (page 2, second column, Roungh'n'Ready audio indexing system) 
tile indexer including: 

Audio classification logic configured to classify the input audio data Into at least 
one of a plurality of broad audio classes, the broad audio classes Including a non- 
phoneme based gender class (page 3 Speaker Segmentation, Clustering, and 
Identification, the speakers are classified by gender for segmentation and identification); 

A speech recognition component configured to generate the rich transcription 
based on the broad audio classes determined by the audio classification logic (page 2, 
Component Technologies, first paragraph); 

A memory system for storing the rich transcription; and a server configured to 
receive requests for documents and respond to the requests by transmitting one or 
more of the rich transcriptions that match the requests (page 4-5, System Architecture, 
server and browser); and 

A server configured to receive requests for documents and respond to the 
requests by transmitting one or more of the rich transcripts that match the requests 
(page 4-5, System Architecture, server and browser). 

However, Colbath does not disclose the broad audio classes including a phoneme- 
based vowel class, a phoneme-based fricative class, and a non-phoneme based 
bandwidth class. Liu discloses Audio classification logic configured to classify the Input 
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audio data into at least one of a plurality of broad audio classes, the broad audio 
classes including a phoneme-based vowel class (Section 3 Phone-Class Decode, 
paragraphs 3 and 4), a phoneme-based fricative class (Section 3 Phone-Class Decode, 
paragraphs 3 and 4), and a non-phoneme based gender class (Section 3 Phone-Class 
Decode, paragraphs 3 and 4). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to use the phone-based models of Liu in Colbath, since it would result 
in improvements in speaker change detection accuracy, speech recognition accuracy 
and speed, as disclosed in Liu (Abstract). 

Additionally, Stanford discloses a speech recognition system that enables 
recognition of high bandwidth or telephony (low bandwidth) speech signals (column 2 
lines 30-32 and column 8 lines 36-44). Stanford states that low bandwidth speech 
reduces the accuracy of speech recognizers (column 3 lines 37-39), and discloses a 
system that trains and uses two separate codebook and phoneme models, one for low 
bandwidth speech and one for high bandwidth speech (column 8 lines 36-44). The 
addition of the low bandwidth recognition model improves recognition accuracy for low 
bandwidth input, such as telephone speech. 

Therefore it would have been obvious to one or ordinary skill in the art at the time 
of the invention to have at least one broad audio class include a non-phoneme based 
bandwidth class in Colbath, since one of ordinary skill in the art has good reason to 
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pursue the options within his or her technical grasp in order to accurately recognize and 
segment audio information for indexing, regardless of the input quality of the audio. 

As per claim 24, this claim recites limitations similar to those recited in claim 8, and is 
therefore rejected for similar reasons. 

As per claim 25, this claim recites limitations similar to those recited in claim 9, and is 
therefore rejected for similar reasons. 

As per claim 26, this claim recites limitations similar to those recited in claim 6, and is 
therefore rejected for similar reasons. 

As per claim 27, Colbath in view of Liu, and further in view of Stanford disclose the 
method of claim 23, and Colbath further discloses wherein the indexer further includes 
at least one of a speaker clustering component, a speaker identification component, a 
name spotting component, and a topic classification component (page 2, Component 
Technologies). 

As per claim 28, this claim recites limitations similar to those recited in claim 1, and is 
therefore rejected for similar reasons. 
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As per claims 29 and 30, these claims recite limitations similar to those recited in claims 
4 and 5, and are therefore rejected for similar reasons. 



As per claim 31, this claim recites limitations similar to those recited in claim 8, and is 
therefore rejected for similar reasons. 

Conclusion 

Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Dorothy Sarah Siedler whose telephone number is 571- 
270-1067. The examiner can normally be reached on Mon-Thur 9:30am-5:30pm. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Richemond Dorvil can be reached on 571-272-7602. The fax phone 
number for the organization where this application or proceeding is assigned is 571- 
273-8300. 
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Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
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For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a 
USPTO Customer Service Representative or access to the automated information 
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